In this work it will be analyzed the impact in quality of several parameters describing red wine. The dataset is curated by Udacity and comes from the UCI repository https://archive.ics.uci.edu/ml/datasets/Wine+Quality and consists of 1599 sample data for Red wine https://docs.google.com/document/d/1qEcwltBMlRYZT-l699-71TzInWfk4W9q5rTCSvDVMpc/pub?embedded=true.
In Cortez et al. (2009) it is shown that the most imporant features for assessing Red Wine quality are:
sulphates
pH
total sulfur dioxide
The dataset has 1599 wine entries and 14 features.
The summary of the dataset is the following:
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality quality.cat
## Min. : 8.40 Min. :3.000 bad : 63
## 1st Qu.: 9.50 1st Qu.:5.000 medium:1319
## Median :10.20 Median :6.000 good : 217
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
X: is the row idquality: range is between 3 and 8quality.cat: generated to group quality in ranges: 0-4 -> BAD, 4-7 -> MEDIUM, 7-10 -> GOOD. More information on the section “Analysis”.In this section it will be analyzed each of the variables describing the wines.
The distribution of wine shows that most of wines have a quality between 5-6 points. The mean value is 5.64 and median 6.
Figure 3.1: Quality distribution
The plot 3.2 shows the distribution of fixed acidity, volatile acidity and citric acid.
Fixed acidity is negatively skewed with mean 8.32 and median 7.9.
The volatile acidity is positively skewed with mean 0.53 and median 0.52.
The citric acid appears to be a bimodal or even a trimodal distribution with overall mean 0.27 and median 0.26.
Figure 3.2: Fixed acidity, volatile acidity and citric acid distribution
Figure 3.3: Residual sugar distribution
The plot 3.4 shows chlorides, free sulfur dioxide and total sulfur dioxide levels distribution.
The distribution of chlorides is positively skewed with mean 0.087 and median 0.079. Outliers within 3 standard deviation from the mean were removed as in the case of residual sugar.
The free sulfur dioxide is positively skewed with mean 16 and median 14.
The total sulfur dioxide is positively skewed with mean 46 and median 38.
Figure 3.4: Chlorides and sulfur dioxide distributions
The plot 3.5 shows density, pH, sulphates and alcohol distributions.
The distribution of density seems to be a symmetrical normal distribution with mean 0.9967 and median 0.9968.
The pH distribution seems to be negatively skewed with mean 3.31 and median 3.31.
The sulphates distribution is positively skewed with mean 0.66 and median 0.62.
Figure 3.5: Density, pH, sulphates and alcohol distributions
There are 1599 red wines with 12 features (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol and quality). The variables quality is converted in a factor variable (adding a new variable named quality.cat) with the following levels:
| BAD | MEDIUM | GOOD | |
|---|---|---|---|
quality.cat |
[0,4] | (4,7) | [7,10] |
Other observations:
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|
redwines.quality |
3 | 5 | 6 | 5.636 | 6 | 8 |
Table 3.1 shows that the mean quality of the red wines is 5.636 and the median is 6. Q1 division corresponds to 5 and Q3 to 6, hence, 50% of the data lays within 5-6 range of quality, this is the level MEDIUM.
The main feature of interest in this dataset is the quality of the wine.
In Cortez et al. (2009) it is shown that the most imporant features for assessing Red Wine quality are: sulphates, pH and total sulfur dioxide. An analysis will be performed in order to certify whether these are the most importart features for the analysis or others are.
I did not create any new variables to support the analysis since the amount of information available is enough to assess the quality of the wine.
For histograms, some of the varaibles, like residual sugar or chlorides show long tails, what indicates the presence of outliers. For a better view, I tried to remove those \(x > \mu \pm3\sigma\).
Figure 3.6: Residual sugar and Chlorides distributions with original and log10() transformed axis
Most of the features had a normal distribution. Some of the them had quite skewed distributions and many oultiers and it could be useful to perform a log10 tranformation in order to have a better view. The efect of this transformation can be observed in figure 3.7, where charateristics of plots in the left side are seen better after the x axis is transformed.
Figure 3.7: Residual sugar and Chlorides distributions with original and log10() transformed axis
A good way to see how to variables are related is using the correlation. In this section it will be calculated the correlation between the features of red wine and the strongest will be analyzed.
In the next figure (fig.4.1) it is shown the correlation factor between the different variables in the Red Wines dataset. Significant correlation values (\(|\rho|>0.3\)) are highlighted with stronger color density the higher the correlation value.
For the purpose of this analysis, the focus will be placed in the relationship between quality and other variables.
Figure 4.1: Correlation bewteen Red Wine’s dataset features
Let us now analyze the correlation between the variables for each quality range in figure 4.2. It can be seen that features are more correlated between them for lower quality wine. Some strong relationships between variables are kept for all quality ranges, as pH-density, pH-citric acid and ph-fixed acidity, or fixed acid-volatile acidity and fixed acid-fixed acidity.
Figure 4.2: Correlation vs Quality
From the general correlation matrix (fig.4.1), three main variables can be selected due to their high correlation with quality: alcohol (\(\rho=\) 0.48 ), volatile.acidity (\(\rho=\) -0.39 ) and sulphates (\(\rho=\) 0.25 ). In order to see if they are suitable to perform an analysis let us explore the correlation between them and also the distribution of samples in bivariate plots in figure 4.3.
Figure 4.3: Correlation between alcohol, volatile acidity and sulphates
From plot 4.3 it is observed that there is a strong relationship between the sulphates and volatile acidity, and less strong between alcohol and volatile acidity.
Let us now explore the three variables suggested in Cortez et al. (2009) as the most imporant features for assessing Red Wine quality are: sulphates, pH and total sulfur dioxide.
Figure 4.4: Correlation between total sulfur dioxide, pH and sulphates
Figure 4.3 shows that the strongest relationship is between sulphates and pH (\(\rho=-0.2\)) followed by pH-total sulfur dioxide.
From figure 4.4 and 4.3 it can be concluded the varibles in the second figure (sulphates, pH and total sulfur dioxide), the ones suggested by Cortez et al. (2009) are less correlated between them and lead to a more accurate analysis.
In this section it will be perfomed an analysis of the features vs the quality ranges. For each variable it will be plotted a histogram per quality range and also a bloxplot showing the interquartile range for each quality range, also showing the outliers as dots. In the boxplot the mean value per quality range is marked with a read asterisk.
In the next plot (fig.4.5) it is deduced that:
the quality of the wine is directly proportional to the fixed acidity and citric acid levels
the quality of the wine is inversely proportional to volatile acidity.
Figure 4.5: Fixed and volatile acidity distributions per quality segment
The plot of residual suggar (4.6) shows that:
Outliers further than \(\pm 3 \sigma\) from the mean have been removed.
Figure 4.6: Residual sugar distribution per quality segment
Figure 4.7: Chlorides and sulfur dioxide distributions per quality segment
The next plot shows that:
generally, the lower the density, the better the quality of the wine.
low pH levels are sign of better quality.
the higher the sulphates level, the better quality of the wine.
the higher amount of alcohol the better the wine.
Figure 4.8: Density, pH, suplhates and alcohol distributions per quality segment
The correlation between three main variables (fig.4.9) and the quality of wine is:
Figure 4.9: Sulphates, pH and total sulfur dioxide vs quality
Figure 4.1 shows the correlation between the variables in the dataset. The strongest are these pairs: citric acid -fixed acidity, total sulfur dioxide - free sulfur dioxide, density- fixed acidity and pH - fixed acidity.
It is logical that citric acid influences fixed acidity, as does free sulfur dioxide with total sulfur dioxide. Ph is known to be related to the acid levels, so no surprise here. I did not know there was such a strong relationship between density and fixed acidity.
Figure 4.10 shows the relationship between fixed acidity and density, and between fixed acidity and pH. Fixed acidity and density have a strong positive correlation (\(\rho=\) 0.67) and fixed acidity and pH show a strong negative correlation (\(\rho=\) -0.68).
Figure 4.10: Density and pH vs fixed acidity
Figure 5.1: Sulphates, pH and total sulfur dioxide relationship vs quality
Figure 5.2: Fixed acidity vs pH, fixed acidity vs citric acid and volatile acidity vs citric acid per quality range
Figure 5.1 shows the 2D density distribution of the three variables (compared in pairs) that influence the most the quality of the wine. It is interesting how the centers seem to displace in a linear manner.
For the case of sulphates-pH there is a negative dependence, coincident with the value of the correlation for these two variables \(\rho=\) -0.2.
For the case of sulphates-total sulfur dioxide there seems to be no dependence, what is coincident with the value of the correlation for these two variables \(\rho=\) 0.04.
For the case of pH-total sulfur dioxide there seems to be no dependence,
what is coincident with the value of the correlation for these two variables \(\rho=\) -0.07.
Figure 5.2 shows the relationship between the highest values of correlation in figure the matrix of correlation in figure 4.1. In this case the variables pairs selected were: fixed acidity vs pH, fixed acidity vs citric acid and volatile acidity vs citric acid, al with different colors per quality range.
For the case of fixed acidity vs pH there is a negative dependence, coincident with the value of the correlation for these two variables \(\rho=\) -0.68.
For the case of fixed acidity vs citric acid there seems to be a strong dependence, what is coincident with the value of the correlation for these two variables \(\rho=\) 0.67.
For the case of volatile acidity vs citric acid there seems to be no dependence, what is coincident with the value of the correlation for these two variables \(\rho=\) -0.55.
I found that there is no significative correlation between pH-total sulfur dioxide. Also, I found that usually, higher quality wine is related to higher levels of citric acid.
I created a multiple linear model with the three main features: sulphates, pH and total sulfur dioxide.
The linear model used takes log10 of each variable to make the model. In figure 6.1 it is shown how the overall error is reduced. However, when looking to the error per quality range (fig. 6.2), the error is higher for extreme values of quality. This makes sense, since the most abundant group is the one with medium quality and hence, the model is more accurate with this segment.
Figure 6.1: Error by linear model
Figure 6.2: Error by linear model
The model has the following strengths and limitations:
strengths: simple and limited to most influencing variables.
limitations: not all variables are included, the combination of variables is linear and the quality is set by humans with different criteria.
Figure 7.1: Correlation matrix
Figure 7.1 shows the correlation coefficient between the variables in the Red Wines dataset.
Figure 7.2: Sulphates, pH and total sulfur dioxide vs quality
Figure 7.2 shows the distribution per quality value for the three main variables influencing the quality of wine, with a line blue showing the linear model of the relation quality-feature and a red one showing the mean value of the feature per quality value.
Figure 7.3: Citric acidity vs quality
Figure 7.3 shows the distribution of citric acid per quality range and also the box plot per quality range. It can be observed how higher quality wines have more levels of citic acid than lower quality wine. I really liked this conclusion, and that is why it is included here, because I always thought it would be the opposite, that lower quality wine would be the more acid.
With this work I have realized how important is to investigate a dataset before starting to make a model or just to extract conclusions. Making an Exploratory Data Analysis (EDA) is an iterative work trying to find the best way to explain relationships between variables and distribution of data and then communicate the result in a clear and concise way.
I found that the correlation matrix is crucial for continuous variables, giving the lead on where you have to look closer at. It is also important to know the data, each variable and for that, the plotting the distributions is important.
Sulphates, pH and total sulfur dioxide are the most important features to explain the Red Wine quality, as suggested by Cortez et al. (2009). However, the correlation matrix shows that it is alcohol, sulphates and volatile acidity the most correlated with quality. After exploring the correlation coefficient between them I could see that “alcohol, sulphates and volatile acidity” set is more correlated between the variables than the “sulphates, pH and total sulfur dioxide” set. From quality exploration it is noticeable that there are no records for wines with quality below 3 or above 8.
This EDA of the Red Wine quality has given me the opportunity to approach a dataset using R and draw conclusions from data. This experience will be helpful in the future when approaching simiar problems.
Cortez, Paulo, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. 2009. “Modeling Wine Preferences by Data Mining from Physicochemical Properties.” Decision Support Systems 47 (4). Elsevier: 547–53.
Medina, Jason. –. “P4 Explore and Summarize Data: Red Wine Eda.” https://rpubs.com/jasonmedina/220283.
Thomas, Iwan. n.d. “EDA of Red Wine Quality Dataset.” https://github.com/IwanThomas/Udacity-Data-Analysis-Nanodegree/blob/master/Project-4/RedWineRMD.Rmd.
“Wine Quality Analysis.” –. https://rpubs.com/sanmen/24803.